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We employ a parameter-free distribution estimation framework where estimators are random 
distributions and utilize the Kullback-Leibler (KL) divergence as a loss function. Wu and Vos 
[ J. Statist. Plann. Inference 142 (2012) 1525-1536] show that when an estimator obtained from 
an i.i.d. sample is viewed as a random distribution, the KL risk of the estimator decomposes in 
a fashion parallel to the mean squared error decomposition when the estimator is a real-valued 
random variable. In this paper, we explore how conditional versions of distribution expectation 
(V) can be defined so that a distribution version of the Rao-Blackwell theorem holds. We 
define distributional expectation and variance (V^) that also provide a decomposition of KL 
risk in exponential and mixture families. For exponential families, we show that the maximum 
likelihood estimator (viewed as a random distribution) is distribution unbiased and is the unique 
uniformly minimum distribution variance unbiased (UMVV) estimator. Furthermore, we show 
that the MLE is robust against model specification in that if the true distribution does not belong 
to the exponential family, the MLE is UMVV for the KL projection of the true distribution 
onto the exponential families provided these two distribution have the same expectation for 
the canonical statistic. To allow for estimators taking values outside of the exponential family, 
we include results for KL projection and define an extended projection to accommodate the 
non-existence of the MLE for families having discrete sample space. Illustrative examples are 
provided. 

Keywords: distribution unbiasedness; extended KL projection; Kullback-Leibler loss; MVUE; 
Pythagorean relationship; Rao-Blackwell 


1. Introduction 

Wu and Vos [13] introduce a parameter-free distribution estimation framework and uti¬ 
lize the Kullback-Leibler (KL) divergence as a loss function. They show that the KL 
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risk of a distribution estimator obtained from an i.i.d. sample decomposes in a fashion 
parallel to the mean squared error decomposition for a parameter estimator, and that an 
estimator is distribution unbiased, or simply unbiased, if and only if its distribution mean 
is equal to the true distribution. Distribution unbiasedness can be defined without using 
any parameterization. We call this approach parameter-free even though there may be 
applications where it is desirable to use a particular parameterization. When the distri¬ 
butions are, in fact, parametrically indexed, distribution unbiasedness handles multiple 
parameters simultaneously and is consistent under reparametrization. Wu and Vos [13] 
also show that the MLE for distributions in the exponential family is always distribution 
unbiased. 

The KL expectation and variance functions E and V are defined by minimizing over 
the space of all distributions. These functions completely describe an estimator in terms 
of its KL divergence around any distribution. In this paper, we introduce distribution 
expectation and variance functions and that are defined by minimizing over a 
smaller space of distributions. For exponential and mixture families, the expected KL 
risk is a function only of these quantities. 

Even though the focus of this paper is on parametric exponential families, our approach 
is parameter-free in that the definitions and results are provided without regard to the 
parameterization of the family. There are three advantages to this approach: one, the 
lack of invariance of bias across parameter transformations is avoided; two, we can allow 
for estimators taking values outside of the exponential family; three, the case where the 
true distribution does not belong to the family is easily addressed. 

Section 2 introduces the distribution expectation and variance functions and shows 
how these are a generalization of the mean and expectation functions for mean square 
error. Exponential families and their extension are discussed in Section 3. The funda¬ 
mental properties of the distribution mean and variance functions allow using the ideas 
of Rao-Blackwell [2] to show that the MLE is the unique uniformly minimum distribu¬ 
tion variance unbiased estimator (UMV^UE). This result is proved in Section 4. Three 
examples are given in Section 5 and Section 6 contains further remarks. 


2. Kullback—Leibler risk, variance, and expectation 

2.1. Motivation 

The parametric version of the Rao-Blackwell theorem can be proved using a Pythagorean 
relationship that holds for mean square error (MSE) and the expectation operator. To 
prove the distribution version of the Rao-Blackwell theorem, we use a similar relation¬ 
ship that holds for KL risk and the KL expectation along with a second Pythagorean 
relationship that holds in exponential families for KL divergence and the KL projection. 
Basic properties of the expectation operator for real-valued random variables used in 
the proof can be extended to distribution-valued random variables. We begin with the 
property that the expectation minimizes the MSE. 
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For (real-valued) random variable Y and a G R, we can define the average behavior of 
Y relative to a using the risk function 


E[d{Y,a)], 

where d is a loss function, that is, a nonnegative convex function on R x R. When 
E[d{Y,a)] < oo for some a, we define 

VdY=UnfE[diY,b)] 


and 

EdY‘^= arg min 6)] 

beR 

if the minimum exists, in which case, 

VdY = E[d{Y,EdY)]. 

When d(a, b) = L{a, b) = {a — 6)^, that is, risk is MSE, we have 

ElY=^ arg minE[L{Y,b)]= [ y di?o EY, (2.1) 

beR j 

VlY inf £;[L(y, b)] = E[L{Y, EY)] VY. (2.2) 

Note that we use the loss function as subscript to indicate expectation and variance 
defined in terms of an argmin and infimum of the loss function, while expectations and 
variances without a subscript are defined in terms of an integral, or in terms of a sum if 
the sample space is discrete. The middle equality signs in equations (2.1) and (2.2) are 
well-known results for EY and VY. These two values completely characterize the risk 
because of the relationship 

E[L{Y,a)\=L{ELY,a) + VLY Vo G R. (2.3) 

In particular, the MSE for a random variable Y is completely determined by knowing 
its expectation EY and variance VY. Note that (2.3) holds for any distribution function 
such that EY and VY exist. For general loss functions d, the argmin EdY and min VdY 
do not characterize the risk; that is, 

E[d{Y,a)]-d{EdY,a) 


will be a function of a. 

The expectation and variance also have the following conditional properties 


EY = EE[Y\X], 

VY = VE[Y\X] + E[V{Y\X)]. 


(2.4) 

(2.5) 
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In the next section, we consider random variables that take values on a space of dis¬ 
tributions TZ and show that when the KL divergence is used to compare distributions, 
equations (2.1) through (2.5) hold for KL risk. 

2.2. Space of all distributions IZ 

Let (X, .^) be a sample space equipped with a cr-finite measure A. When X is finite or 
countable, A is usually the counting measure. When X C and X contains an open set 
of for some d = 1,2,..., then A is usually the Lebesgue measure on K.'^. Requiring X 
to contain an open set implies that the dimension of X is d. Let TZ be the collection of 
all probability measures R on (X, that are absolutely continuous with respect to A, 
that is, A(A) = 0 implies R{A) = 0 tor all A € TTZ". This is denoted as i? <IC A. Note that 
we allow the support of i? to be a proper subset of X. 

Let R (in bold font) be a random quantity whose values are distributions in TZ. The 
density of the distribution R with respect to A will be denoted by r (in lower case), and 
the corresponding random variable by r (in bold font lower case). Following Definition 
2.1 in [13], R is an 72.-valued random variable if R(A) is a real-valued random variable 
for a\\ A ^ . We are considering the problem of estimating a distribution so for this 

paper R = Rx is any estimator of an unknown distribution Rq gTZ where X is an i.i.d. 
sample from Rq. A random distribution is a mapping from X" to TZ. Let S be another 
random quantity that is jointly distributed with R. 

Theorem 2.1. For every S = s, Kg = £’[R|S' = s] is a probability measure that is abso¬ 
lutely eontinuous with respect to X, that is, Kg G TZ, is unique up to measure zero (X), 
and has a density 

ks{y) = E[r{y)\S = s] foryGX. (2.6) 

In addition, when s is replaced with the random variable S, Ks = £'[R|5'] is an TZ-valued 
random variable. 

Proof. For all s it is easily seen that Kg is a probability measure because Kg is countably 
additive and KgCX.) = 1 — Kg{0) = 1, where 0 is the empty set. The remaining claims of 
the theorem can be established by noting that equation (2.6) can be written as 

ks{y)= j rx(y)r(((x|s)dA”(x), (2.7) 

where Tq (x|s) is the conditional distribution of x given s. Since 

E[Ti{A)\s]= f [ rx( 2 /)dA(y)r^(x|s)dA”(x), 

JX" J A 

the set A G ff is arbitrary, and the integrals can be interchanged, we see that ks{y) is 
the density for Kg and Kg G TZ for each s so Ks is an 72.-valued random variable. □ 
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For 7^-valued random variable R and RgTZ, we can define the average behavior of R 
relative to R using the risk function 

F;[d(R,i?)], 

where d is a loss function, that is, a nonnegative convex function on TZ x TZ. Note that 
the expectation used to define the risk is with respect to some distribution i?o G TZ] Rq 
will be fixed but arbitrary other than constraints to ensure that the quantities in the 
expressions below exist and that the support of Rq is X. For any function d such that 
i?[d(R, i?)] < oo for some R, we define 

VdR=^ inf F;[d(R,i?i)] 

-Ri G'R- 

and 

i?dR argmini5[d(R, i?i)] 

RiGlZ 

if the minimum exists, in which case, 

VdR = E[diR,EdR)]. 

def 

For KL risk, that is, when d(i?i,i? 2 ) = D{Ri,R 2 ) = Eji^ log(ri/r 2 ), we have 

ifijR argminii'[£>(R, i?i)] = f rx(y)rQ (x) dA"(x) i?R, (2.8) 

i?i £7^ J 

VdR mi^E[D{R, i?i)] = ED{R, ER) ='' ^R. (2.9) 

The middle equalities in equations (2.8) and (2.9) are established in Wu and Vos [13]. 
Since these are equal when D is the KL divergence and we consider no other divergence 
functions on TZ x TZ, we will simply write ER € TZ and VR S R for the KL mean and 
variance. 

Furthermore, ER and VR completely characterize the average behavior of the TZ- 
valued random variable R relative to any distribution R gTZ because of the relationship 

E[D{R,R)]=D{ER,R) + VR Vi? e 7^. (2.10) 

This means the KL risk for an 7?.-valued random variable R, having any distribution func¬ 
tion, is completely determined by knowing its argmin, ER G TZ, and minimum, VR > 0. 
When R = Rq, equation (2.10) gives the decomposition of the KL risk in terms of bias 
and variance. The relationship in (2.10) will not hold for general nonnegative convex 
functions d. In this paper we only consider KL divergence 77(7?i,i?2)- Furthermore, a 
conditional expectation on 7?.-valued random variables can be defined so that the follow¬ 
ing conditional properties hold 


ER = EE[R\S], 

VR = vf;[r|S'] -b f;[v(r|5')]. 


( 2 . 11 ) 

( 2 . 12 ) 
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where S could be 72.-valued but could also be real or other valued since values of S will 
only be used to generate sub sigma fields. 

Theorem 2.2 (Characterization theorem for expected KL divergence on 72). 

Let 7?q S 72 have support X and let R be an TZ-valued random variable such that the KL 
mean EH and the KL variance yR exist and are finite. Then for any 7? € 72 the mean 
divergence between R and R depends only on the KL mean iJR and KL variance l/R. 
Furthermore, the KL mean and KL variance satisfy the classical conditional equalities 
(2.11) and (2.12). 

Proof. Equation (2.10) follows from the definition of KL variance and Theorem 5.2 in 
[13] who show that the expected KL loss i?[Z7(R, 7?)] from an 72-valued random variable 
R to a distribution 7? G 72 decomposes as 

E[D{R, R)] = E[D{Ii, ER)] + D{ER, R). (2.13) 

Equation (2.11) follows from the fact that the KL means ER and 7?[R|S'] have densities 
with respect to A and the order of integration can be interchanged. The steps are the 
same as those that establish EX = EE[X\Y] for R-valued random variables X and Y. 
We rewrite (2.10) as 

E[D{R,R)]-D{ER,R) = VR. (2.14) 

Note that both expectations (with domain R-valued random variables and with domain 
72-valued random variables) and the variance depend on the data generation distribution 
7?o, which can be any point in 72 with support X. If this equation holds for random sample 
Xi, ..., Xn then it also applies to the conditional distribution of Xi, ..., Xn given S = s 

E[D{R,R)\s]-D{E[R\s],R) = V(R\.s). 

Substituting S into the equation above and taking expectation gives 

E[D{R, 77)] - E;[77(E[R|5'],7?)] = £;[y(R|S')]. (2.15) 

Substituting 7i’[R|S'] into R in (2.14) and using 7?7il[R|S'] = ER gives 

E[77(E;[R|S'],7?)] -77(£;R,7?) = y(E[R|S']). (2.16) 

Adding (2.15) to (2.16) and substituting from (2.14) proves (2.12). □ 

The random variable R is a distribution function defined on the sample space and it 
will be useful to relate R to a statistic T. We define ij.t{R) = EffT G and when we 
consider only one statistic we write /J.(7?) = ^t( 7?). The R'^-valued random variable /j.(R) 
describes the behavior of the 72-valued random variable R and the mean of ^(R) can be 
obtained from the KL mean. 
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Theorem 2.3 (Expectation property on TZ). Eor any statistic T such that /J.(R) < oo 
a.e., the mean of T under ER equals the mean ofW^-valued random variable /x(R) 

/i(£:R) =E[/z(R)]. (2.17) 

Proof. The density for ER can be written as / rx(y)rQ (x) dA”(x) so that 

fi{ER) = J T{y) j rx(y)r(((x)dA’"(x)dA(?/) 

= J J T{y)r^{y)dX{y)dX^{yi) = E[qi{R)] 

because the order of integration can be switched. □ 

2.3. General subspace P 

We typically are interested in a subfamily of distributions V GTZ and we describe a dis¬ 
tribution in terms of the KL risk E[D{R, P)] for P gV. We add the regularity condition 
that the support of each distribution in V is X. Equation (2.10) shows that ER and PR 
give the KL risk for any P gV. However, generally ER ^ V even if R takes values only 
in V. We consider whether an expectation can be defined that takes values in V and so 
that (2.10) holds. We will define this expectation as a minimum over V. We define 

ptR= inf i5[Z7(R,P)] 


and 

E^R = arg minP[P(R, P)] 
Per 

if the minimum exists, in which case 

V^R = E[D{R,E^R)]. 


Equation (2.10) now becomes 

E[D{R,P)]=D{E^R,P) + V^R+A{ER,E'<R,P) VPeP, (2.18) 

where 

A(PR, P^R, P) = D{ER, P) - P(PR, P^r,) _ p(ptR^ py (2.19) 

If A vanishes for all P £ P then the argmin plR and the min ptR completely character- 
ize R in terms of KL risk. When A is small these functions can be used to approximate 
the KL risk of R. We will show the term A vanishes when V is an exponential family. 
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The relationship between the expectations ETi and E^R can be expressed by using the 
KL projection onto V 


Hi? = aigmmD{R,P). 
Pev 


By equation (2.10), 

E'<R = UER. (2.20) 

For any V, we have that yR < since V CTZ. These results are summarized in the 
following theorem. 

Theorem 2.4. Let Rq GlZ such that the support of Rq is X and let R be an TZ-valued 
random variable such that the distribution mean iflR and the distribution variance y^R 
exist and are finite. Then for any P GV the mean divergence between R and P is given 
by (2.18). The term A measures the extent to which the KL mean, distribution mean, 
and P depart from forming a dual Pythagorean triangle. The KL variance is less than 
or equal to the distribution variance, l^R and the distribution mean is the KL 

projection of the KL mean onto V, £'^R = ni?R. 

Wu and Vos [13] show that A = 0 for all P G 7^ an exponential family. For mixture 
families ER = plR. Hence, A vanishes when V is either an exponential family or mixture 
family. 

While we don’t know how to write E'^ as an integral and the expectation property 
(2.17) does not hold for in general, we show equations (2.11) and (2.12) hold with E 
replaced with pl and V replaced with when V is either an exponential or mixture 
family. Furthermore, the expectation property will hold for pl when P is an exponential 
family and T is the canonical statistic. 


3. Exponential family 'P 


For a general subspace V G TZ the distribution mean plR and distribution variance 
V^R do not characterize E[D{R, P)] for P gV. However, when V is an exponential 
family these quantities do characterize i?[Il(R, P)] and the classical equalities relating 
conditional mean and variance hold. A standard reference for exponential families is 
Brown [3] , but the approach we take here is slightly different since our emphasis is on the 
distributions without regard to any particular parameterization. An exponential family 
V will be defined by selecting a point Pq G 77. and statistic T{x) taking values in The 
defining property of an exponential family is that for any P G P the log of the density 
of P with respect to Pq is a linear combination of T{x) and the constant function. We 
start with some definitions and basic properties. 
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Definition 3.1. V is an exponential family on X if there exists Pq (zTZ such that the 
support of Pq is X and a function T: X i—^ such that for any P gP 

dP oc e® for some 9 

The distribution Pq is called a base point and T is called the canonical statistic ofP. 
The canonical parameter space is 

9{fP) = {0 € R'^ :for some P GP, dP oc e® dPo}. 

Without loss of generality, we can choose a base point Pq such that Pq G P. We’ll refer 
to exponential families using base points that belong to the family. 

Definition 3.2. Let P he an exponential family with base point Pq, canonical statistic 
T, and set Q = {0 gW^ : J dPo < oo}. The cumulant function has domain 0 and 

is defined as 

'ip{9)=\og J e® dPo- 
The density with respect to Pq for any P GP is 
dP 

—— = exp{d'P(x) — '0(d)} for some 9 G 9{P). 

dPo 

The family P is regular if 9{P) is open and P is full if 9{P) = 0. 

By the factorization theorem, T is sufficient. It will often be useful to restrict the choice 
of T so that it is complete for the full exponential family P. 

Definition 3.3. A statistic T is complete for P if 

Eph{T) = 0 VPeP ^ h{T)=0 a.e.P. 

The following theorem shows that the projection operator on P behaves like the ex¬ 
pectation operator on P (Theorem 2.3) and will be used to show that the classical 
conditional expectation equation holds for E^. 

Theorem 3.1 (Projection property on P). IfU is the KL projection onto P, where 
P is an exponential family having canonical statistic T and = ErT, then for any 
RgP such that pl{R) G /i(P), 

yi{nR)=yi{R), (3.1) 

where fi{P) = {yiGMf^ :for some P GP,p, = EpT} is the mean parameter space of P. 
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Proof. This result follows from the relationship between the natural and expectation 
parameters for an exponential family V. Let /xi = ^{Pi) for some Pi G V. Then the 
natural parameter 0{Pi) of this distribution satisfies 

0(Pi) = argmax[0'/ii — ■*/'(^)] (3-2) 

6»ee 


and since 9 parameterizes P, 

Pi = argmax[0(P)'^i — ^(0(P))]. (3.3) 

PGV 

The result now follows for exponential family V by simple calculation 

ni?i = argminZl(i?i P) 

Pev 

= argmin(P_Rj logri - Er^ logp) 

Per 

= argminP/{j logp 
Per 

= argmin(0(P)V(Pi) - V'(^(^))) 

Per 

= Pu 

where p(Pi) = p{Ri) by (3.3). □ 

Corollary 3.1 (Pythagorean property for exponential families). Let V be an 

exponential family and let RgTZ such that HP exists. For all P & V 

P(P,P) = P(P,nP) + P(nP,P). (3.4) 

This is a well-known result. See, for example, [4] or [6]. 

We define an extended projection HP to be any distribution in TZ such that expectation 
and Pythagorean properties hold and it belongs to the “boundary” of P; that is, 

/r(P) = p(nP), (3.5) 

p(p,p) = p(p,np) + p(np,p) VPeP, (3.6) 

inf P(nP,P) = 0. 

pg-p 

Note that HP satisfies these three equalities, and that the last two equalities imply 

P(P,nP) = inf P(P,P). 

Pgp 

The extended projection allows us to define the extended MLE in the next section. 

The Pythagorean property allows us to improve P-valued random variables by the 
projection If or, more generally, by If. 
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Corollary 3.2 (Projection property for 7^-valued random variables). //HR 

exists a.e., then 

E[D(R,P)]>E[D(nR,P)] 
with equality holding if and only if HR = R a.e. 

Proof. Replacing R with R in equation (3.6) and taking expectations shows 

£;[£i(R,p)] = £;[D(R,nR)] + i;[D(nR,p)] vpep 

and the result follows from the fact that P[P(R, HP)] > 0 with equality holding if and 
only if R = nR a.e. □ 

3.2. Fundamental equations for distribution mean and variance 

For exponential families, the distribution expectation and variance have the same prop¬ 
erties as the KL expectation and variance. One distinction is that the expectation prop¬ 
erty of E holds for any statistic while for pi the expectation property holds only for the 
canonical statistic T. 

Theorem 3.2 (Characterization of expected KL divergence on P). Let RqGTZ 
have support X and let R be an TZ-valued random variable such that the distribution mean 
ptR exists and the distribution variance PtR is finite. Then for any P £P, where P 
is an exponential family, the mean KL divergence between R and P depends only on the 
distribution mean and distribution variance 

P[P(R,P)] = P(p1'R,P)-f ptR VPeP. (3.7) 

Assuming the conditional expectations and variances exist, the distribution mean and 
distribution variance satisfy the classical conditional equalities 

ptR = £:t^t[R|5-]^ (3,g) 

ptR = yt^t[R|5-]_^^[yt(R|5)]^ (3 9) 

where S is a real-valued random vector. Purthermore, the expectation property holds for 
the canonical statistic T 

^(ptR) =£;[^(R)]. (3.10) 

Proof. By Corollary 3.1 and equation (3.1) the correction term (2.19) vanishes showing 
that equation (3.7) holds. Equation (3.10) follows from 


pl{e^r) = h{e-r) 


( 3 . 11 ) 
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and the expectation property on TZ (2.17). Equation (3.11) follows from the (extended) 
projection property for exponential families (3.1) and (3.5) and the relationship between 
E and E'^ (2.20). Now equation (3.8) follows from 

/r(£;t£;t[R|5]) = £;[/i(£;t[R|5])] 

= E[^{E[Il\S])] 

= ^i{EE[R\S]) 

= ^i{ER) 

= KE^R), 

where the first equality follows from (3.10), the second and fifth equalities follow from 
(3.11), the third equality follows from the expectation property of the KL mean on TZ, 
and the fourth equality follows from the conditional expectation property that holds on 
TZ (2.11). Equation (3.9) follows again the same steps that justified (2.12). We rewrite 
(3.7) as 

E[D{R, R)] - £»(£;1'R, R) = 0R. (3.12) 

If this equation holds for random sample Xi, ..., Xn then it also applies to the conditional 
distribution of Xi, ..., X„ given S = s 

E[D{R,R)\s]-D{E'<[R\s],R) = V'<{R\s). 

Substituting S into the equation above and taking expectation gives 

E[D{R,R)] - £;[i:>(£;1'[R|5'],R)] = £;[E^(R|S')]. (3.13) 

Substituting £’1[R|S'] into R in (3.12) and using Eli?l[R|S'] = E^R gives 

£;[£i(E^[R|S'],R)] -i:)(£;1'R,R) = El'E^[R|5]. (3.14) 

Adding (3.13) to (3.14) and substituting from (3.12) proves (3.9). □ 


4. Rao—Blackwell and the MLE as the unique 
UMV^U distribution estimator 

An immediate corollary to the characterization theorem on T’ (equations (3.7), (3.8), and 
(3.9)) is that for any random distribution R and any statistic S, the random distribution 
E^jRIS”] will have the same distribution mean and have distribution variance less than 
or equal to that of R. If S' = T is sufficient then i?l[R|T] is an estimator and if T is 
also complete El[R|T] will have smaller variance than R unless they are equal with 
probability one. This conditional expectation is enough to establish a Rao-Blackwell 
result for distribution estimators if these were restricted to T’. However, since we are 
allowing 7?.-valued estimators we also need to project the distributions onto P using H. 
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For an exponential family {P{y] t)} having mean parameter r S fJ-iP) = M and discrete 
sample space we typically have that Pr(T G M) < 1 while Pr(T G M) = 1 where M is the 
closure of M. In this case, the MLE does not always exist. However, the characterization 
theorem applies to 7?.-valued estimators so we can define an estimator that equals the 
MLE P{y;t) when it exists and as a distribution P{y;t) such that y,{P{y;t)) = t and 
infpgp D{P, P) = 0 if t ^ M. The extended MLE as distribution estimator is 




P_{y\t) iitGM, 
P{y\t) \it<^M. 


Unbiasedness of P* follows from the following theorem. 


Theorem 4.1 (Distribution unbiased estimators in exponential families). Let 

V be an exponential family with complete sufficient statistic T and let R 6e a TZ-valued 
random variable. The estimator R is distribution unbiased for Pq = HPq if o-nd only if 
At(P([R|T]) = T a.e. 


Proof. We must show HPR = Pq for all Pq G P if and only if /j,(P[R|T]) = T a.e. for all 
Pq G P. Consider the following equivalencies each of which holds for all Pq G P: 


HPR = Po 
/r(nPR) = ^(Pq) 
^(PR)=^(Po) 
y{EE[R\T]) = y{Po) 
E[^,{E[R\T])]=^l{Po) 
P[/r(P[R|r])]=P(r). 


The first equivalence follows because the expectation of T parameterizes P, the sec¬ 
ond equivalence follows from the projection property for exponential families, the third 
equivalence follows from the conditional expectation defined for the KL mean, the fourth 
equivalence follows from the expectation property for the KL mean, and the fifth equiv¬ 
alence follows from the definition of the function y,. Clearly, /i(P[R|r]) = T a.e. implies 
the last equality. Since T is complete and the last equality holds for all Pq G P, this 
implies 

y{E[R\T])=T a.e. ° 

Theorem 4.2 (Optimality of the MLE for exponential families). Let Xi,... ,Xn 

be i.i.d. from a distribution Rq G TZ such that the support of Rq is X. Let P be an 
exponential family with complete sufficient statistic T such that y{Ro) G y(P). If P is 
the MLE or an extended MLE that exists a.e., then P is distribution unbiased for the 
HPq o,nd it is the unique uniformly minimum distribution variance estimator among all 
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TZ-valued estimators that are distribution unbiased for ni?o and for which the extended 
projection HR exists a.e. 

Proof. Uniqueness and uniform minimum distribution variance follow from the projec¬ 
tion property for 72.-valued random variables, the characterization theorem on V de¬ 
scribed above, and the unbiasedness from Theorem 4.1. □ 


5. Examples 

5.1. Binomial distribution 


We consider the number of events or “successes” in n trials. The sample space is 

X = {0,l,2,...,n}. 


Under the assumptions that these trials are independent and each trial has the same 
success probability 0 < 0 < 1, the distribution of X belongs to the n-binomial family 


V = {P€n:P{x)=Pe{x) 





n—x 


for some 0 < 0 < 1 }. 


The MLE for the parameter 9 is 6 = x/n for x ^ {0,n} but is undefined otherwise. The 
extended MLE (it will correspond in a natural way to the extended MLE distribution 
estimator) is 9 = xjn for all x G X and it is unbiased for 9. However, it is not unbiased for 
other parameterizations such as the odds v = 0/(1 — 0), or the log odds 7 = logi^. When 
viewed as a distribution, that is, P§{x), equivalently, Po{x) or P^{x) (where we allow 
the odds ly and log odds 7 to take values in the extended reals), the MLE is the unique 
uniformly minimum distribution variance unbiased estimator. As is common practice, we 
have used the same notation 0 for both the MLE and the extended MLE. 

Estimators, whether real-valued or distribution-valued, are functions with domain X. 
For the n-binomial family an estimator is given by a sequence of n-|-1 values, real numbers 
for 0 and probability distributions for Pg. For 0, we have the sequence 


0 12 n — 1 n 

5 5 7 ■ • ■ 5 5 • 

n n n n n 


(5.1) 


Let Pgo be a distribution in P. If probabilities of Pg^ are used to assign weights to the 
values in (5.1), then the real number that is closest to the weighted values of (5.1) is 9q. 
That is, 

00 = arg min e( - 0 

ee(o,i) V 



By the Rao-Blackwell theorem, for any other sequence of n -I- 1 real numbers 


2/(0), 2/(1), 2/(2),..., 2/(n - l),y(n) 


(5.2) 
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that satisfy 

9o = a.TgmmE{y{X) - 9f, 

66 ( 0 , 1 ) 

the realized minimum will be greater than the minimum obtained using the values in 
(5.1) unless the sequences are equal, y{x) =xln for x S {0,1,2,..., n}. 

A distribution estimator Pg obtained from the real valued estimator given in (5.1) can 
be defined as 

(^) ; jn (^) 1 ^2/n (6^) 1 ■ ■ ■ ; P{n — 1) jn (^) j (^) j (^*^) 

where la is the indicator function for its subscript; that is, the degenerate distribution 
putting all mass on 0 or 1. Since mip^-p D{Ia, P) = 0 it is easily checked that lila = la 
which means that the sequence in (5.3) is the extended MLE P*. Hence, P* = P§- Again, 
we let P 0 g be any distribution in P. If Pg^ is used to assign weights to the distributions in 
(5.3), then the distribution in V that is closest to the weighted average of the distributions 
in (5.3) is Pg^. That is, 

Pgg =aigiJiinE[D{Pg,P)]. 

Per 

By the distribution version of the Rao-Blackwell theorem (Theorem 4.2) for any estima¬ 
tor 0, expressed as a distribution estimator, 

^e(o)’'^6(i)’-'-’^6(n) (5-4) 


that satisfies 


Peo =argmin£;[i:)(Pg,P)], 

Per 

the realized minimum will be greater than that of the MLE (5.3) unless the two sequences 
of functions (5.3) and (5.4) are equal. Theorem 4.2 provides a stronger result than this 
since the distributions need not belong to V. In the class of all distribution unbiased 
estimators of the form 


Ro{x),Ri{x),R2{x), . . . ,Pn_i(x),P„(x) 

for which the extended projections H exists, the MLE (5.3) has smallest distribution 
variance. In the Hardy-Weinberg model estimators that do not belong to the family V 
have been suggested. We consider the details in Section 5.2. 

The choice of the n-binomial model P was based on the assumptions that the data 
represented independent and identical trials. If either of these assumptions were grossly 
violated, the binomial model would not be appropriate. However, this model can be used 
when these assumptions hold approximately in the sense that there is a distribution 
Po = HPq in P that is close to the data generation distribution Rq, that is, D{Ro,Po) is 
small. In this case, the MLE is the unique UMVtU estimator for Pq. 
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5.2. Hardy—Weinberg model 

For a single pair of alleles A and a, which occur with probabilities 9 and {1 — 9) for 
9 G (0,1), the Hardy-Weinberg (HW) model defines the relative frequency of genotypes 
AA, Aa, and aa to be 7 ri( 0 ) = 0^, 7 r 2 ( 0 ) = 20(1 — 0), and tt3 (9) = (1 — 0)^. For this example, 
we can take TZ to be the collection of trinomial models with probabilities ( 7 ri, 7 r 2 , 7 r 3 ) for 
TTi + 7 r 2 + TTs = 1 which can be represented by the simplex in 2-dimensional space. See 
Figure 1 for the simplex. The open circles in Figure 1 are the extended MLE (tti, 7 r 2 , tts) = 
{Yi,Y2,Y3)/n for the trinomial with n = 6 trials, where Yi and I 2 are the counts for AA 
and Aa. The solid curve in the simplex is the HW model 

V = {(7ri,7r2,7r3) :7ri =0^,712 = 20(1 -0 ),773 = (1 - 0)^} 

which is a one dimensional exponential family with canonical sufficient statistic T = 
2Yi -I- Y 2 and canonical natural parameter log(0/(l — 0)). Chow and Fong [5] find the 
UMVU for TTi and 713 using 

a,Pi-02)'] + f;,[(^3-(i-0)')"] 

as squared-error loss. They show the UMVU is inadmissible by exhibiting a dominating 
estimator. Both the UMVU and the dominating estimator take values outside the HW 
model. In terms of distribution estimators, these are 7^-valued estimators. 


e = 1/2 (all Aa) 



Figure 1. A Hardy-Weinberg (HW) model with n = 6 trials. The simplex represents the tri¬ 
nomial model space on (771,772,713) for tti -|- 772 -t 773 = 1 , while the solid curve is the HW model 
space on 7 ri( 0 ) = 0 ^, 772(0) = 20(1 — 0 ), and 773(0) = (1 — 9)^ for 0 < 0 < 1 . The open circles 
represent the (extended) MLE under the trinomial model (771,772,773) = (Vi, y2, Vj)/?!, and the 
solid dots are the (extended) MLE under the HW model 0 = ( 2 yi -|- Ypij^n. The dashed curve 
shows the KL mean of the HW MLE for each value of 0 . 
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The extended MLE for the HW model is 0 = (2Fi + Y2)/2n while the extended dis¬ 
tribution MLE is Pg where Pq is the degenerate distribution putting all its mass on 
( 0 , 0 , 6 ) (the lower left vertex) and Pi is the degenerate distribution putting all its mass 
on (6,0,0) (the lower right vertex). The extended HW MLE is represented by the solid 
dots in Figure 1. 

Among the difficulties with the UMVU estimator and the dominating estimator is that 
there are other ways to define squared-error loss (using one bin or two other bins). These 
are avoided by using KL divergence. Since V is an exponential family the extended 
MLE is the UMV^U for all 7^-valued estimators but also for all 77.-valued estimators 
since the projection exists for all points in the simplex other than the two lower vertices 
which satisfy the extended projection. As a comparison, the KL mean, represented by 
the dashed curve in Figure 1, lives outside the model so the extended MLE isn’t KL 
unbiased. This is due to the curvature in the exponential family. 

5.3. Poisson distribution 

The Poisson family of distributions is 

V = < P € TZ: P\(x) = e~^ — for some A > 0 
( x\ 

where a; G X = {0,1,2,...}. 

Let Xi,... ,Xn be a simple random sample from a Poisson distribution P\g. The sum 
= Ai -b • • • -b Xn is a complete sufficient statistic of the family. Although the Poisson 
family is typically parametrized by a single parameter, we consider estimates for the 

probability Pr(Ai =i) = AQe~^°/i! for some i = 0,1,_A crude but unbiased estimator 

is 

ifXi = z, 

* 10 otherwise. 

Given the sum Sn, Xi is distributed as a binomial(5'„, 1/n) random variable, the Rao- 
Blackwell theorem shows that 


10 otherwise, 

is an unbiased estimator of Pr(Ai = i). Since Su depends on the complete sufficient statis¬ 
tic Sn only, it must be the unique MVUE of Pr(Ai = i). Using the criterion of distribution 
unbiasedness, these anomalous estimators do not arise. Since Sn is the canonical statistic, 
the MLE X = Sn/n is the unique UMVU estimator for A and the extended distribution 
MLE Px is the UMV^U estimator for Px where Px is Iq when A = 0. 

To show how the UMVU estimator can fail completely, Lehmann [11] considers the 
parameter S = {P{X = 0))^ for n = 1. In this case, the unique UMVU estimator is {—2)^. 
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Since the sample consists of nonnegative integers this estimator is represented by the 
following sequence of real numbers 


1,-2,4,-8,16,.... 


Parametric unbiasedness means that if the Poisson distribution that assigns probability 
(5^^^ to P{X = 0) is used to assign probability to the terms in the sequence then (5 = 
argmin^gR i?((—2)'^ — a)^. That is, the parameter is the real number that is closest to 
this sequence in terms of mean square error. In addition, the weighted average of the 
above sequence is i5. 

By focusing on distributions rather than the parameters that name the distributions 
these problems are avoided. The MLE, as a distribution estimator, is represented by the 
following sequence of probability distributions 


Io{x),e 


-| X OX qx 

-1_ p-2_ p-3£_ 

x\ x\ x\ 


Distribution unbiasedness means that if the Poisson distribution P\ is used to assign 
probability to the terms in the sequence then 


P\ = argmin£'[D(Pj^, P)]. 

Per 

That is, the distribution that generates the data is the distribution in the exponential 
family that is closest to this sequence in terms of KL risk. Any other sequence of distri¬ 
butions with this property will have greater distribution variance. 


6. Discussion 

The distribution version of the Rao-Blackwell theorem 4.2 has been developed by analogy 
with important properties of mean square error for the parametric version. In particular, 
we have used a Pythagorean-type property for two asymmetric distribution-like functions: 
the KL divergence D{-, •) and its expectation E[D{-, •)]. For exponential family P, we have 

p>(R,p) = p>(R,nR)-t-P(nR,p) VPeP 

while for all TZ 

P[P>(R, R)] = E[D{R, PR)] -H P[P(PR, R)] 

so that the expectation operator E defined on P-valued random variables for the KL 
risk plays the role of the projection operator 11 for the KL divergence. Each operator 
is a map from a more complicated space to a simpler space, E from P-valued random 
variables to a distribution in TZ and 11 from distributions in P to a distribution in P, 
that preserve the KL risk and KL divergence, respectively. 
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The restriction to exponential families is essentially required by the criterion of having 
a sufficient statistic of fixed dimension for all sample sizes n. Specifically, the Darmois- 
Koopman-Pitman theorem which follows from independent works of Darmois [7], Koop- 
man [10] and Pitman [12] shows that when only continuous distributions are considered, 
the family of distributions of the sample has a sufficient statistic of dimension less than 
n if and only if the population distribution belong to the exponential family. Denny 
[8] shows that for a family of discrete distributions, if there is a sufficient statistic for 
the sample, then either the family is an exponential family or the sufficient statistic is 
equivalent to the order statistics. 

The MLE is parameter-invariant which means that the same distribution is named by 
the parametric ML estimate regardless of the parameter chosen to index the family. One 
approach to studying parameter-invariant quantities is to use differential geometry (e.g., 
Amari [1] or Kass and Vos [9]). The parameter-invariant approach does not work well 
for parameter-dependent quantities such as bias and variance of parametric estimators. 
Our approach allows for the definition of parameter-free versions of bias and variance. 
Furthermore, the distribution version of the Rao-Blackwell provides two extensions: (1) 
minimum variance is taken over a larger class of estimators that includes estimators that 
are not required to take values in the model space V, (2) the true distribution need not 
belong to V. 

The fact that the MLE is the unique uniformly minimum distribution variance unbiased 
estimator for exponential families distinguishes the MLE from other estimators. This is 
in contrast to asymptotic methods applied to MSE that can be used to show superior 
properties of the MLE but, being asymptotic results, do not apply uniquely to the MLE. 

Asymptotically, MSE and KL risk are the same and the MSE can be viewed as an 
approximation to KL risk for large n. The distribution version of the Rao-Blackwell 
Theorem 4.2 provides support for Fisher’s claim of the superiority of the MLE even in 
small samples. 
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